docs: add doc to recover from pod from lost node #8742

subhamkrai · 2021-09-17T09:48:09Z

this commit adds the doc which has the manual
steps to recover from the specific scenario
like
on the node lost, the new pod can't mount the same volume.

Closes: #1507
Signed-off-by: subhamkrai srai@redhat.com

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #1507

Checklist:

Documentation/ceph-csi-troubleshooting.md

travisn · 2021-09-17T22:15:00Z

Documentation/ceph-csi-troubleshooting.md

+
+### Shorten the timeout
+
+To shorten the timeout, you can mark the node as "blacklisted" so Rook can safely failover the pod sooner.


In case the node is just offline and there is no watcher active, we need to just blacklist the whole node, rather than blacklist just a session id, right? Seems like we could simplify this section to just blacklist the node ip.

Response? I don't understand why we would want to only backlist only a session id instead of always blocklisting the whole node. The point is also to prevent a node from coming back online and create a new session, right?

@Madhu-1 can you help here? Thanks

We need to blacklist the IP as we want to block all sessions of that node.

Ok, so if we want to blacklist all sessions, we only need the node ip, right? And no need to get the PV session IDs?

Yes we just need the Node IP to blacklist it.

Documentation/ceph-csi-troubleshooting.md

leseb

Start the sentence with a capital letter in your commit message.

Documentation/ceph-csi-troubleshooting.md

travisn · 2021-09-22T16:27:36Z

Documentation/ceph-csi-troubleshooting.md

+
+### Shorten the timeout
+
+To shorten the timeout, you can mark the node as "blacklisted" so Rook can safely failover the pod sooner.


Response? I don't understand why we would want to only backlist only a session id instead of always blocklisting the whole node. The point is also to prevent a node from coming back online and create a new session, right?

subhamkrai · 2021-09-30T07:30:12Z

Documentation/ceph-csi-troubleshooting.md

+$ PV_NAME= # enter pv name
+$ IMAGE=$(kubectl get pv $PV_NAME-o jsonpath='{.spec.csi.volumeHandle}' | cut -d '-' -f 6- | awk '{print "csi-vol-"$1}')
+$ echo $IMAGE
+```
+
+The solution is to remove the watcher, following the commands below from the [Rook toolbox](ceph-toolbox.md):
+
+```console
+$ rbd status <image> --pool=<pool name> # get image from above output
+```
+>```
+> Watchers:
+>	watcher=10.130.2.1:0/2076971174 client.14206 cookie=18446462598732840961
+>```
+
+```console
+$ ceph osd blacklist add 10.130.2.1:0 # to know which watcher to block see above output
+blacklisting 10.130.2.1:0
+```


@travisn before I make changes just to confirm I'll remove the above part and will just ask the user to get node IP(the node which is lost) and blacklist that.

and if Ceph version > octopus we'll use ceph osd blacklist else ceph osd blocklist

If the ceph version is pacific and above use blocklist else use blacklist.

right, thanks

Correct, we're just blocking the node, rather than a session id.

mergify · 2021-10-01T05:11:43Z

This pull request has merge conflicts that must be resolved before it can be merged. @subhamkrai please rebase it. https://rook.io/docs/rook/latest/development-flow.html#updating-your-fork

subhamkrai · 2021-10-04T06:30:34Z

@travisn ^^^

Documentation/ceph-csi-troubleshooting.md

feedback addressed

This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same volume`. Closes: rook#1507 Signed-off-by: subhamkrai <srai@redhat.com>

docs: add doc to recover from pod from lost node (backport #8742)

subhamkrai mentioned this pull request Sep 17, 2021

docs: add doc to recover from pod from lost node #7282

Closed

10 tasks

subhamkrai requested review from travisn, BlaineEXE, leseb and Madhu-1 September 17, 2021 09:50

travisn requested changes Sep 17, 2021

View reviewed changes

subhamkrai added the skip-ci label Sep 20, 2021

subhamkrai force-pushed the cant-mount-volume-nodelost branch from 9bcd369 to 2b96b94 Compare September 20, 2021 07:08

leseb requested changes Sep 20, 2021

View reviewed changes

subhamkrai force-pushed the cant-mount-volume-nodelost branch from 2b96b94 to 514a2ba Compare September 21, 2021 15:20

subhamkrai requested review from leseb and travisn September 22, 2021 15:10

leseb reviewed Sep 22, 2021

View reviewed changes

Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved

subhamkrai force-pushed the cant-mount-volume-nodelost branch from 514a2ba to 2bf5381 Compare September 22, 2021 15:22

leseb approved these changes Sep 22, 2021

View reviewed changes

satoru-takeuchi previously requested changes Sep 22, 2021

View reviewed changes

Documentation/ceph-csi-troubleshooting.md Outdated Show resolved Hide resolved

travisn requested changes Sep 22, 2021

View reviewed changes

subhamkrai force-pushed the cant-mount-volume-nodelost branch from 2bf5381 to 123fe41 Compare September 28, 2021 13:33

subhamkrai requested review from travisn and satoru-takeuchi September 28, 2021 13:34

subhamkrai force-pushed the cant-mount-volume-nodelost branch from 123fe41 to e6331b4 Compare September 28, 2021 13:43

subhamkrai commented Sep 30, 2021

View reviewed changes

subhamkrai force-pushed the cant-mount-volume-nodelost branch from e6331b4 to 815b31a Compare October 1, 2021 05:11

subhamkrai force-pushed the cant-mount-volume-nodelost branch from 815b31a to 29b1391 Compare October 1, 2021 05:16

travisn requested changes Oct 4, 2021

View reviewed changes

subhamkrai force-pushed the cant-mount-volume-nodelost branch from 29b1391 to f089198 Compare October 5, 2021 05:04

subhamkrai requested a review from travisn October 5, 2021 05:05

travisn requested changes Oct 5, 2021

View reviewed changes

Documentation/ceph-csi-troubleshooting.md Show resolved Hide resolved

Documentation/ceph-csi-troubleshooting.md Show resolved Hide resolved

subhamkrai force-pushed the cant-mount-volume-nodelost branch from f089198 to 4e1906f Compare October 5, 2021 15:42

docs: add doc to recover from pod from lost node

7587704

This commit adds the doc which has the manual steps to recover from the specific scenario like `on the node lost, the new pod can't mount the same volume`. Closes: rook#1507 Signed-off-by: subhamkrai <srai@redhat.com>

subhamkrai force-pushed the cant-mount-volume-nodelost branch from 4e1906f to 7587704 Compare October 5, 2021 15:43

travisn approved these changes Oct 5, 2021

View reviewed changes

travisn added the backport-release-1.7 label Oct 5, 2021

travisn merged commit 086649c into rook:master Oct 5, 2021

mergify bot mentioned this pull request Oct 5, 2021

docs: add doc to recover from pod from lost node (backport #8742) #8921

Merged

mergify bot added a commit that referenced this pull request Oct 5, 2021

Merge pull request #8921 from rook/mergify/bp/release-1.7/pr-8742

0f91763

docs: add doc to recover from pod from lost node (backport #8742)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add doc to recover from pod from lost node #8742

docs: add doc to recover from pod from lost node #8742

subhamkrai commented Sep 17, 2021

travisn Sep 17, 2021

travisn Sep 22, 2021

subhamkrai Sep 22, 2021

Madhu-1 Sep 28, 2021

travisn Sep 28, 2021

Madhu-1 Sep 29, 2021

leseb left a comment

travisn Sep 22, 2021

subhamkrai Sep 30, 2021

leseb Sep 30, 2021

subhamkrai Sep 30, 2021

travisn Sep 30, 2021

mergify bot commented Oct 1, 2021

subhamkrai commented Oct 4, 2021


		### Shorten the timeout

		To shorten the timeout, you can mark the node as "blacklisted" so Rook can safely failover the pod sooner.

docs: add doc to recover from pod from lost node #8742

docs: add doc to recover from pod from lost node #8742

Conversation

subhamkrai commented Sep 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leseb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Oct 1, 2021

subhamkrai commented Oct 4, 2021